mutation score
LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework
Lops, Andrea, Narducci, Fedelucio, Ragone, Azzurra, Trizio, Michelantonio, Bartolini, Claudio
Unit testing is an essential but resource-intensive step in software development, ensuring individual code units function correctly. This paper introduces AgoneTest, an automated evaluation framework for Large Language Model-generated (LLM) unit tests in Java. AgoneTest does not aim to propose a novel test generation algorithm; rather, it supports researchers and developers in comparing different LLMs and prompting strategies through a standardized end-to-end evaluation pipeline under realistic conditions. We introduce the Classes2Test dataset, which maps Java classes under test to their corresponding test classes, and a framework that integrates advanced evaluation metrics, such as mutation score and test smells, for a comprehensive assessment. Experimental results show that, for the subset of tests that compile, LLM-generated tests can match or exceed human-written tests in terms of coverage and defect detection. Our findings also demonstrate that enhanced prompting strategies contribute to test quality. AgoneTest clarifies the potential of LLMs in software testing and offers insights for future improvements in model design, prompt engineering, and testing practices.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Italy > Apulia > Bari (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (17 more...)
Mutation Testing for Industrial Robotic Systems
Santos, Marcela Gonçalves dos, Hallé, Sylvain, Petrillo, Fábio
Industrial robotic systems (IRS) are increasingly deployed in diverse environments, where failures can result in severe accidents and costly downtime. Ensuring the reliability of the software controlling these systems is therefore critical. Mutation testing, a technique widely used in software engineering, evaluates the effectiveness of test suites by introducing small faults, or mutants, into the code. However, traditional mutation operators are poorly suited to robotic programs, which involve message-based commands and interactions with the physical world. This paper explores the adaptation of mutation testing to IRS by defining domain-specific mutation operators that capture the semantics of robot actions and sensor readings. We propose a methodology for generating meaningful mutants at the level of high-level read and write operations, including movement, gripper actions, and sensor noise injection. An empirical study on a pick-and-place scenario demonstrates that our approach produces more informative mutants and reduces the number of invalid or equivalent cases compared to conventional operators. Results highlight the potential of mutation testing to enhance test suite quality and contribute to safer, more reliable industrial robotic systems.
- South America > Uruguay > Maldonado > Maldonado (0.04)
- South America > Brazil > Paraná > Curitiba (0.04)
- Oceania > Australia > Queensland > Brisbane (0.04)
- (9 more...)
Combining TSL and LLM to Automate REST API Testing: A Comparative Study
Barradas, Thiago, Paes, Aline, Neves, Vânia de Oliveira
The effective execution of tests for REST APIs remains a considerable challenge for development teams, driven by the inherent complexity of distributed systems, the multitude of possible scenarios, and the limited time available for test design. Exhaustive testing of all input combinations is impractical, often resulting in undetected failures, high manual effort, and limited test coverage. To address these issues, we introduce RestTSLLM, an approach that uses Test Specification Language (TSL) in conjunction with Large Language Models (LLMs) to automate the generation of test cases for REST APIs. The approach targets two core challenges: the creation of test scenarios and the definition of appropriate input data. The proposed solution integrates prompt engineering techniques with an automated pipeline to evaluate various LLMs on their ability to generate tests from OpenAPI specifications. The evaluation focused on metrics such as success rate, test coverage, and mutation score, enabling a systematic comparison of model performance. The results indicate that the best-performing LLMs - Claude 3.5 Sonnet (Anthropic), Deepseek R1 (Deepseek), Qwen 2.5 32b (Alibaba), and Sabia 3 (Maritaca) - consistently produced robust and contextually coherent REST API tests. Among them, Claude 3.5 Sonnet outperformed all other models across every metric, emerging in this study as the most suitable model for this task. These findings highlight the potential of LLMs to automate the generation of tests based on API specifications.
- South America > Brazil > Pernambuco > Recife (0.05)
- North America > United States (0.04)
- South America > Brazil > Rio de Janeiro > Niterói (0.04)
- (6 more...)
Impact of Code Context and Prompting Strategies on Automated Unit Test Generation with Modern General-Purpose Large Language Models
Walczak, Jakub, Tomalak, Piotr, Laskowski, Artur
--Generative AI is gaining increasing attention in software engineering, where testing remains an indispensable reliability mechanism. According to the widely adopted testing pyramid, unit tests constitute the majority of test cases and are often schematic, requiring minimal domain expertise. Automatically generating such tests under the supervision of software engineers can significantly enhance productivity during the development phase of the software lifecycle. This paper investigates the impact of code context and prompting strategies on the quality and adequacy of unit tests generated by various large language models (LLMs) across several families. The results show that including docstrings notably improves code adequacy, while further extending context to the full implementation yields definitely smaller gains. Notably, the chain-of-thought prompting strategy -- applied even to'reasoning' models -- achieves the best results, with up to 96.3% branch coverage, a 57% average mutation score, and near-perfect compilation success rate. Among the evaluated models, M5 (Gemini 2.5 Pro) demonstrated superior performance in both mutation score and branch coverage being still in top in terms of compilation success rate. ECENT years have brought significant advancements in artificial intelligence (AI), particularly in the areas of performance and productivity enhancement. However, AI -- and particularly large language models (LLMs) -- still suffer from several weaknesses. Among them, convincing but senseless content generation ('hallucination'), safety misalignment ('ethicality') [1], unfairness [2], and limited processing context are the most critical. In spite of these restrictions, and bearing in mind the limited and merely apparent creativity of LLMs [3], they have become versatile tools already widely used across a variety of domains (creative industries [4], entertainment, reporting, and software engineering [5] are just cases in point) for multiple tasks.
- Research Report > New Finding (1.00)
- Overview (1.00)
On Accelerating Deep Neural Network Mutation Analysis by Neuron and Mutant Clustering
Mutation analysis of deep neural networks (DNNs) is a promising method for effective evaluation of test data quality and model robustness, but it can be computationally expensive, especially for large models. To alleviate this, we present DEEPMAACC, a technique and a tool that speeds up DNN mutation analysis through neuron and mutant clustering. DEEPMAACC implements two methods: (1) neuron clustering to reduce the number of generated mutants and (2) mutant clustering to reduce the number of mutants to be tested by selecting representative mutants for testing. Both use hierarchical agglomerative clustering to group neurons and mutants with similar weights, with the goal of improving efficiency while maintaining mutation score. DEEPMAACC has been evaluated on 8 DNN models across 4 popular classification datasets and two DNN architectures. When compared to exhaustive, or vanilla, mutation analysis, the results provide empirical evidence that neuron clustering approach, on average, accelerates mutation analysis by 69.77%, with an average -26.84% error in mutation score. Meanwhile, mutant clustering approach, on average, accelerates mutation analysis by 35.31%, with an average 1.96% error in mutation score. Our results demonstrate that a trade-off can be made between mutation testing speed and mutation score error.
- North America > United States > Massachusetts > Suffolk County > Boston (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Alabama > Lee County > Auburn (0.04)
- (6 more...)
VALTEST: Automated Validation of Language Model Generated Test Cases
Taherkhani, Hamed, Hemmati, Hadi
Large Language Models (LLMs) have demonstrated significant potential in automating software testing, specifically in generating unit test cases. However, the validation of LLM-generated test cases remains a challenge, particularly when the ground truth is unavailable. This paper introduces VALTEST, a novel framework designed to automatically validate test cases generated by LLMs by leveraging token probabilities. We evaluate VALTEST using nine test suites generated from three datasets (HumanEval, MBPP, and LeetCode) across three LLMs (GPT-4o, GPT-3.5-turbo, and LLama3.1 8b). By extracting statistical features from token probabilities, we train a machine learning model to predict test case validity. VALTEST increases the validity rate of test cases by 6.2% to 24%, depending on the dataset and LLM. Our results suggest that token probabilities are reliable indicators for distinguishing between valid and invalid test cases, which provides a robust solution for improving the correctness of LLM-generated test cases in software testing. In addition, we found that replacing the identified invalid test cases by VALTEST, using a Chain-of-Thought prompting results in a more effective test suite while keeping the high validity rates.
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Estonia > Harju County > Tallinn (0.04)
MILE: A Mutation Testing Framework of In-Context Learning Systems
Wei, Zeming, Zhang, Yihao, Sun, Meng
In-context Learning (ICL) has achieved notable success in the applications of large language models (LLMs). By adding only a few input-output pairs that demonstrate a new task, the LLM can efficiently learn the task during inference without modifying the model parameters. Such mysterious ability of LLMs has attracted great research interests in understanding, formatting, and improving the in-context demonstrations, while still suffering from drawbacks like black-box mechanisms and sensitivity against the selection of examples. In this work, inspired by the foundations of adopting testing techniques in machine learning (ML) systems, we propose a mutation testing framework designed to characterize the quality and effectiveness of test data for ICL systems. First, we propose several mutation operators specialized for ICL demonstrations, as well as corresponding mutation scores for ICL test sets. With comprehensive experiments, we showcase the effectiveness of our framework in evaluating the reliability and quality of ICL test suites. Our code is available at https://github.com/weizeming/MILE.
Automated Test Case Generation Using Code Models and Domain Adaptation
Hashtroudi, Sepehr, Shin, Jiho, Hemmati, Hadi, Wang, Song
State-of-the-art automated test generation techniques, such as search-based testing, are usually ignorant about what a developer would create as a test case. Therefore, they typically create tests that are not human-readable and may not necessarily detect all types of complex bugs developer-written tests would do. In this study, we leverage Transformer-based code models to generate unit tests that can complement search-based test generation. Specifically, we use CodeT5, i.e., a state-of-the-art large code model, and fine-tune it on the test generation downstream task. For our analysis, we use the Methods2test dataset for fine-tuning CodeT5 and Defects4j for project-level domain adaptation and evaluation. The main contribution of this study is proposing a fully automated testing framework that leverages developer-written tests and available code models to generate compilable, human-readable unit tests. Results show that our approach can generate new test cases that cover lines that were not covered by developer-written tests. Using domain adaptation, we can also increase line coverage of the model-generated unit tests by 49.9% and 54% in terms of mean and median (compared to the model without domain adaptation). We can also use our framework as a complementary solution alongside common search-based methods to increase the overall coverage with mean and median of 25.3% and 6.3%. It can also increase the mutation score of search-based methods by killing extra mutants (up to 64 new mutants were killed per project in our experiments).
- North America > Canada > Ontario > Toronto (0.14)
- North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
DeepMetis: Augmenting a Deep Learning Test Set to Increase its Mutation Score
Riccio, Vincenzo, Humbatova, Nargiz, Jahangirova, Gunel, Tonella, Paolo
Deep Learning (DL) components are routinely integrated into software systems that need to perform complex tasks such as image or natural language processing. The adequacy of the test data used to test such systems can be assessed by their ability to expose artificially injected faults (mutations) that simulate real DL faults. In this paper, we describe an approach to automatically generate new test inputs that can be used to augment the existing test set so that its capability to detect DL mutations increases. Our tool DeepMetis implements a search based input generation strategy. To account for the non-determinism of the training and the mutation processes, our fitness function involves multiple instances of the DL model under test. Experimental results show that \tool is effective at augmenting the given test set, increasing its capability to detect mutants by 63% on average. A leave-one-out experiment shows that the augmented test set is capable of exposing unseen mutants, which simulate the occurrence of yet undetected faults.
- North America > United States > New York > New York County > New York City (0.05)
- Oceania > Australia (0.04)
- North America > United States > Tennessee > Shelby County > Memphis (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
Sentinel: A Hyper-Heuristic for the Generation of Mutant Reduction Strategies
Guizzo, Giovani, Sarro, Federica, Krinke, Jens, Vergilio, Silvia Regina
Mutation testing is an effective approach to evaluate and strengthen software test suites, but its adoption is currently limited by the mutants' execution computational cost. Several strategies have been proposed to reduce this cost (a.k.a. mutation cost reduction strategies), however none of them has proven to be effective for all scenarios since they often need an ad-hoc manual selection and configuration depending on the software under test (SUT). In this paper, we propose a novel multi-objective evolutionary hyper-heuristic approach, dubbed Sentinel, to automate the generation of optimal cost reduction strategies for every new SUT. We evaluate Sentinel by carrying out a thorough empirical study involving 40 releases of 10 open-source real-world software systems and both baseline and state-of-the-art strategies as a benchmark. We execute a total of 4,800 experiments, and evaluate their results with both quality indicators and statistical significance tests, following the most recent best practice in the literature. The results show that strategies generated by Sentinel outperform the baseline strategies in 95% of the cases always with large effect sizes. They also obtain statistically significantly better results than state-of-the-art strategies in 88% of the cases, with large effect sizes for 95% of them. Also, our study reveals that the mutation strategies generated by Sentinel for a given software version can be used without any loss in quality for subsequently developed versions in 95% of the cases. These results show that Sentinel is able to automatically generate mutation strategies that reduce mutation testing cost without affecting its testing effectiveness (i.e. mutation score), thus taking off from the tester's shoulders the burden of manually selecting and configuring strategies for each SUT.
- Europe > United Kingdom (0.14)
- South America > Uruguay > Maldonado > Maldonado (0.04)
- South America > Brazil > Paraná > Curitiba (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)